Milestone 1

Overview

Important Questions

Context

Why is this problem important to solve?

Climate change's threat looms ever nearer the entire globe. According to NASA, climate change will cause flooding, drought, heat waves, stronger hurricanes, and a higher sea level in the US. While some climate changes occur naturally, humanity's CO2 emissions already caused climate change that would not have happend otherwise. Susan Solomon et al states the concentration of CO2 in our atmosphere is already so large that the effects will be irreversable for 1,000 years even if all emissions were to stop today. In order to continue enjoying this beautiful planet in the long run, humanity must manage its resources more efficiently.

Objective

What is the intended goal?

The US Environmental Protection Agency estimates that approximately 25% of all greenhouse gas emissions (including CO2) come from electricity production. The same EPA article states that the US produces 62% of its energy by burning fossil fuels such as coal and natural gas. Predicting future CO2 emissions from different energy sources may help policy makers and innovators understand where to focus their green energy efforts.

Key questions

What are the key questions that need to be answered?

  • What energy sources emit the most CO2?
  • What energy sources emit the least CO2?
  • Which energy sources are expected to emit CO2 at an even faster rate in the near future?
  • Which energy sources should emit less CO2 as time goes on without intervention?

Problem Formulation

What is it that we are trying to solve using data science?

The purpose of this project is to identify potential opportunities to prevent the acceleration of CO2 emission rates from electricity generation.

Data Dictionary

This datset is the past monthly data of Carbon dioxide emissions from electricity generation from the US Energy Information Administration categorized by fuel type such as Coal, Natural gas etc.

  • MSN - Reference to Mnemonic Series Names (U.S. Energy Information Administration Nomenclature)
  • YYYYMM - The month of the year on which these emissions were observed
  • Value - Amount of CO2 Emissions in Million Metric Tons of Carbon Dioxide
  • Description - Different category of electricity production through which carbon is emissioned.

Data Preparation

Before we visualize and model our data, we must make sure there are no null or otherwise invalid arguments and format the data into a dataframe.

In [18]:
# Set up Google Colab.
from google.colab import drive
drive.mount('/content/drive')

# Avoid scroll-in-the-scroll for large outputs.
from IPython.display import Javascript
def resize_colab_cell():
  display(Javascript('google.colab.output.setIframeHeight(0, true, {maxHeight: 5000})'))
get_ipython().events.register('pre_run_cell', resize_colab_cell)
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [19]:
# Import basic libraries.
!pip install statsmodels --upgrade
import pandas as pd
import warnings
import itertools
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from statsmodels.graphics import tsaplots
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from statsmodels.tsa.stattools import coint, adfuller
from statsmodels.tsa.ar_model import AutoReg
warnings.filterwarnings("ignore")
Requirement already satisfied: statsmodels in /usr/local/lib/python3.7/dist-packages (0.13.1)
Requirement already satisfied: scipy>=1.3 in /usr/local/lib/python3.7/dist-packages (from statsmodels) (1.4.1)
Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.7/dist-packages (from statsmodels) (1.1.5)
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.7/dist-packages (from statsmodels) (0.5.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from statsmodels) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25->statsmodels) (2018.9)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from patsy>=0.5.2->statsmodels) (1.15.0)
In [20]:
# Load the data.
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/MER_T12_06.xlsx')

# Convert dates into standard datetime
df['YYYYMM'] = pd.to_datetime(df['YYYYMM'], errors = 'coerce', format = '%Y%m')

# Drop rows with NaT in date column and set date as index.
df = df.dropna(subset = ['YYYYMM']).set_index('YYYYMM')
df.head(15)
Out[20]:
MSN Value Description
YYYYMM
1973-01-01 CLEIEUS 72.076 Coal Electric Power Sector CO2 Emissions
1973-02-01 CLEIEUS 64.442 Coal Electric Power Sector CO2 Emissions
1973-03-01 CLEIEUS 64.084 Coal Electric Power Sector CO2 Emissions
1973-04-01 CLEIEUS 60.842 Coal Electric Power Sector CO2 Emissions
1973-05-01 CLEIEUS 61.798 Coal Electric Power Sector CO2 Emissions
1973-06-01 CLEIEUS 66.538 Coal Electric Power Sector CO2 Emissions
1973-07-01 CLEIEUS 72.626 Coal Electric Power Sector CO2 Emissions
1973-08-01 CLEIEUS 75.181 Coal Electric Power Sector CO2 Emissions
1973-09-01 CLEIEUS 68.397 Coal Electric Power Sector CO2 Emissions
1973-10-01 CLEIEUS 67.668 Coal Electric Power Sector CO2 Emissions
1973-11-01 CLEIEUS 67.021 Coal Electric Power Sector CO2 Emissions
1973-12-01 CLEIEUS 71.118 Coal Electric Power Sector CO2 Emissions
1974-01-01 CLEIEUS 70.55 Coal Electric Power Sector CO2 Emissions
1974-02-01 CLEIEUS 62.929 Coal Electric Power Sector CO2 Emissions
1974-03-01 CLEIEUS 64.519 Coal Electric Power Sector CO2 Emissions
In [21]:
# Check the info of the dataframe.
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4707 entries, 1973-01-01 to 2016-07-01
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   MSN          4707 non-null   object
 1   Value        4707 non-null   object
 2   Description  4707 non-null   object
dtypes: object(3)
memory usage: 147.1+ KB

Observations:

  • Our dataframe has 3 object columns with no missing data.
  • There are 4,707 observations even after deleting observations with invalid dates.
  • Since the Value column is supposed to be a float, we need to convert numeric data in this column to a float and delete all rows with invalid values that cannot be converted to a float.
In [22]:
# Convert valid values in the Value column to float and drop any rows with invalid values in said column.
# Check the info of the dataframe.
df['Value'] = pd.to_numeric(df['Value'], errors = 'coerce', downcast = 'float')
df = df.dropna(subset = ['Value'])
df.head(15)
Out[22]:
MSN Value Description
YYYYMM
1973-01-01 CLEIEUS 72.075996 Coal Electric Power Sector CO2 Emissions
1973-02-01 CLEIEUS 64.442001 Coal Electric Power Sector CO2 Emissions
1973-03-01 CLEIEUS 64.084000 Coal Electric Power Sector CO2 Emissions
1973-04-01 CLEIEUS 60.841999 Coal Electric Power Sector CO2 Emissions
1973-05-01 CLEIEUS 61.798000 Coal Electric Power Sector CO2 Emissions
1973-06-01 CLEIEUS 66.538002 Coal Electric Power Sector CO2 Emissions
1973-07-01 CLEIEUS 72.625999 Coal Electric Power Sector CO2 Emissions
1973-08-01 CLEIEUS 75.181000 Coal Electric Power Sector CO2 Emissions
1973-09-01 CLEIEUS 68.397003 Coal Electric Power Sector CO2 Emissions
1973-10-01 CLEIEUS 67.667999 Coal Electric Power Sector CO2 Emissions
1973-11-01 CLEIEUS 67.021004 Coal Electric Power Sector CO2 Emissions
1973-12-01 CLEIEUS 71.117996 Coal Electric Power Sector CO2 Emissions
1974-01-01 CLEIEUS 70.550003 Coal Electric Power Sector CO2 Emissions
1974-02-01 CLEIEUS 62.929001 Coal Electric Power Sector CO2 Emissions
1974-03-01 CLEIEUS 64.518997 Coal Electric Power Sector CO2 Emissions

Observations:

  • Our dataframe now has 2 object columns and 1 float32 column.
  • There are now 4,323 observations with no missing data.
  • We are now ready to visualize our data!

Dataset Visualization

Further Formatting

Since the dataset has 8 different sources of carbon emissions, we first need to group by the description. Then we will be able to compare the sources to each other in our visualizatiuons. Once we do so, we can drop the MSN column from the dataset.

In [23]:
# Create a dataframe that groups by description and uses the description as the column names.
EnergyTypes = ['Coal', 'Distillate Fuel', 'Geothermal Energy', 'Natural Gas', 'Non-Biomass Waste',
              'Petroleum Coke', 'Petroleum', 'Residual Fuel Oil', 'Total Emissions']
Emissions = pd.DataFrame(index = df.index.unique(), columns = EnergyTypes)
grouped = df.groupby('Description')

for g, c in zip(grouped.groups, EnergyTypes):
    Emissions[c] = grouped.get_group(g).iloc[:, 1]

Emissions.head(15)
Out[23]:
Coal Distillate Fuel Geothermal Energy Natural Gas Non-Biomass Waste Petroleum Coke Petroleum Residual Fuel Oil Total Emissions
YYYYMM
1973-01-01 72.075996 2.375 NaN 12.175000 NaN 0.128 27.368999 24.867001 111.621002
1973-02-01 64.442001 2.061 NaN 11.708000 NaN 0.106 23.034000 20.867001 99.184998
1973-03-01 64.084000 1.171 NaN 13.994000 NaN 0.083 21.034000 19.780001 99.112000
1973-04-01 60.841999 1.022 NaN 14.627000 NaN 0.130 17.714001 16.562000 93.182999
1973-05-01 61.798000 0.949 NaN 17.344000 NaN 0.167 18.870001 17.754000 98.012001
1973-06-01 66.538002 1.787 NaN 20.264999 NaN 0.171 22.856001 20.899000 109.660004
1973-07-01 72.625999 1.965 NaN 23.066999 NaN 0.150 25.242001 23.127001 120.934998
1973-08-01 75.181000 2.536 NaN 22.850000 NaN 0.166 27.158001 24.455999 125.189003
1973-09-01 68.397003 1.432 NaN 19.297001 NaN 0.116 23.757999 22.209999 111.452003
1973-10-01 67.667999 1.473 NaN 17.896999 NaN 0.102 23.726000 22.150999 109.291000
1973-11-01 67.021004 1.537 NaN 13.766000 NaN 0.087 22.837000 21.212999 103.624001
1973-12-01 71.117996 1.606 NaN 11.866000 NaN 0.152 22.104000 20.346001 105.088997
1974-01-01 70.550003 1.835 NaN 11.959000 NaN 0.183 23.021000 21.003000 105.529999
1974-02-01 62.929001 1.837 NaN 11.021000 NaN 0.148 19.976999 17.992001 93.928001
1974-03-01 64.518997 1.868 NaN 13.860000 NaN 0.156 19.478001 17.454000 97.857002

Observations:

  • The Geothermal Energy and Non-Biomass Waste columns don't have any observations until 1989.
  • We will have to drop all rows that have any nulls in order for comparisons to be over the same timeframes and make sense.
In [24]:
# Drop rows with null values in Emissions.
EmissionsDropped = Emissions.dropna()

# View information on our new dataframe.
EmissionsDropped.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 331 entries, 1989-01-01 to 2016-07-01
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Coal               331 non-null    float32
 1   Distillate Fuel    331 non-null    float32
 2   Geothermal Energy  331 non-null    float32
 3   Natural Gas        331 non-null    float32
 4   Non-Biomass Waste  331 non-null    float32
 5   Petroleum Coke     331 non-null    float32
 6   Petroleum          331 non-null    float32
 7   Residual Fuel Oil  331 non-null    float32
 8   Total Emissions    331 non-null    float32
dtypes: float32(9)
memory usage: 14.2 KB

Observations:

  • We now have 331 months of observations for each of the 8 energy types as well as total emissions.
  • There are no nulls in our dataset.
In [25]:
# View the head of our data.
EmissionsDropped.head(15)
Out[25]:
Coal Distillate Fuel Geothermal Energy Natural Gas Non-Biomass Waste Petroleum Coke Petroleum Residual Fuel Oil Total Emissions
YYYYMM
1989-01-01 132.233002 0.898 0.031 8.890000 0.371 0.145 12.754000 11.711 154.279007
1989-02-01 124.345001 1.061 0.028 10.420000 0.335 0.100 14.688000 13.526 149.815002
1989-03-01 122.803001 1.176 0.031 12.758000 0.371 0.107 13.849000 12.565 149.811005
1989-04-01 111.194000 0.457 0.030 14.187000 0.359 0.116 9.644000 9.071 135.412994
1989-05-01 115.915001 0.665 0.031 15.128000 0.371 0.109 8.498000 7.723 139.942993
1989-06-01 126.030998 0.905 0.030 15.696000 0.359 0.117 10.648000 9.625 152.763000
1989-07-01 138.082993 0.953 0.031 19.290001 0.371 0.179 10.452000 9.320 168.227005
1989-08-01 139.613998 0.669 0.031 18.679001 0.371 0.177 9.411000 8.565 168.106003
1989-09-01 124.594002 0.667 0.030 16.153999 0.359 0.165 8.652000 7.820 149.789001
1989-10-01 119.943001 0.516 0.031 15.327000 0.371 0.121 7.040000 6.403 142.712997
1989-11-01 120.823997 0.649 0.030 11.410000 0.359 0.101 9.630000 8.880 142.253006
1989-12-01 143.264008 2.528 0.031 10.349000 0.371 0.152 18.434999 15.755 172.449997
1990-01-01 132.923004 0.589 0.033 9.182000 0.492 0.153 9.988000 9.246 152.617004
1990-02-01 116.260002 0.464 0.029 8.295000 0.445 0.236 8.165000 7.465 133.194000
1990-03-01 121.806000 0.436 0.033 11.590000 0.492 0.236 8.793000 8.122 142.714005

Observations:

  • Coal often contributes a whole order of magnitude more carbon emissions than all the other energy sources combined.
  • Petroleum, Residual Fuel Oil, and Natural Gas all contribute the most after coal to carbon emissions consistently; often an order of magnitude more.
  • All of the energy types seem to have peak emissions around July/August and then start to decrease. Later when we check for stationarity, we should check and make sure this isn't a seasonal pattern.

Chart Time

Now that the dataframe is in a more user-friendly format, we are ready to create the different plots and graphs.

Box Plots

In [26]:
# Create boxplots to visualize each energy source's emissions over this timeframe.
EmissionsDropped.boxplot(figsize=(24, 12), grid = False, fontsize=12, color = 'orange');

Observations:

  • Coal has the widest distribution - therefore the greatest standard deviation - of all the energy types.
  • Coal is slightly left-skewed.
  • Distillate Fuyel, Natural Gas, Petroleum Coke, Petroleum, and Residual Fuel Oil are all slighlty right-skewed.
  • Coal is by far the greatest contributor to carbon emissions.

Bar Chart

In [27]:
# Compare total emissions from each energy source in a bar plot.
EmissionsDropped.sum().plot.bar(figsize=(24, 12), fontsize=12, ylabel = 'CO2 Emissions (MMT)', xlabel = 'Energy Type', title = '1989 - 2016 CO2 Emissions by Energy Type');

Observations:

  • Coal alone contriputes over 3/4 of total CO2 emissions for this time period. Either we heavily rely on coal to generate energy and/or it is highly inefficient.
  • Natural gas is the 2nd-greatest contributor. This makes sense given that coal is often used for industrial purposes and most homes and apartments use natural gas for heat and cooking.
  • Geothermal energy emits the least amount of CO2. This is probably because geothermal energy is a newer technology, but it may also get more energy per MMT of CO2 emissions.

Line Graph of Each Source Over Time

In [28]:
# Show the relationship between power generation carbon emissions and time in a line graph.
EmissionsDropped.plot(subplots = True, figsize = (24,24), xlabel = 'Year', ylabel = 'Carbon Emissions (MMT)');

Observations:

  • While coal may possibly have a constant mean and standard deviation, it has very strong seasonality which makes it non-stationary. The relatively constant mean indicates that coal has emitted roughly the same amount of carbon emissions over time.
  • Distillate fuel may be stationary. There is no obvious seasonal pattern, the mean tends to stay constant when not spiking, and the standard deviation doesn't appear to change much. This may be wrong, depending on the impact of the sudden random spikes. Distillate fuel hasn't deviated much from about .25 MMT of carbon emissions every month over time.
  • While geothermal's standard deviation appears constant and maybe the mean is constant, there is some strong seasonality. Overall, geothermal energy emits about .005 MMT more carbon emissions in 2016 than it did in 1989.
  • Natural gas has some seasonality, a slightly increasing trend, and greater standard deviation with time. Natural gas emits about 53 MMT more carbon in 2016 than it did in 1989.
  • Non-biomass waste has seasonality and a moving mean, but a constant standard deviation. The moving mean indicates an increase in carbon emissions of about .6 MMT.
  • While petroleum coke doesn't appear to have a seasonal pattern, its standard deviation is greater with time and the mean appears to increase over time; it is emitting about .9 MMT more every month.
  • Petroleum has no obvious seasonality, but the standard deviation and mean both decrease. It went from about 14 MMT a month in 1989 to about 1 MMT a month in 2016 for its carbon emissions.
  • Residual fuel oil looks like petroleum no obvious seasonality and a decreasing standard deviation and mean. Residual fuel oil went from 12 MMT of CO2 in 1989 to about 1 MMT in 2016.
  • Overall CO2 emissions appear to have a roughly constant mean around 175 MMT per month from 1989 to 2016. It does appear to have strong seasonality and the standard deviation increased slightly over time.

Rolling Averages and Standard Deviations Over Time

In [29]:
# Now we will view the rolling averages for all of the energy sources for the first 100 observations.
# Calculating the rolling mean and standard deviation for a window of 10 observations.
for c in EnergyTypes:
  rolmean=EmissionsDropped[c].rolling(window=10).mean()
  rolstd=EmissionsDropped[c].rolling(window=10).std()

#Visualizing the rolling mean and standard deviation

  plt.figure(figsize=(16,8))
  actual = plt.plot(EmissionsDropped[c], color='grey', label=c)
  rollingmean = plt.plot(rolmean, color='blue', label='Rolling Mean') 
  rollingstd = plt.plot(rolstd, color='purple', label='Rolling Std. Dev.')
  plt.title('Rolling Mean & Standard Deviation')
  plt.legend()
  plt.show()

Observations

  • With seasonality in the rolling mean and a downward trend, coal is not a stationary series.
  • Distillate fuel is difficult to tell if it is stationary due to the random spikes. The slight downward trend and changes in standard deviation indicate it is not a stationary series.
  • Geothermal's strong seasonality keeps it from being a stationary series.
  • Natural gas has an upward trend as well as a seasonal mean and standard deviation so it is not a stationary series.
  • Non-biomass's upward trend and seasonality make it non-stationary.
  • Petroleum Coke is probably not stationary due to the overall increase in its rolling mean.
  • Petroleum's decreasing mean and slightly decreasing standard deviation keep it from being stationary.
  • Residual oil is not stationary due to its decreasing standard deviation and rolling mean.
  • Total emissions has seasonality in its rolling mean and a slight increase in standard deviation so it is not stationary.

Forcasting Preparation

For this project, we are only interested in analyzing natural gas CO2 emissions. Therefore, we need a dataframe which contains only the date as an index and the natural gas observations. When we went to visualize the data, we had to drop all of the observations measured before 1989 since some of the energy sources did not have data until then. However, natural gas has no missing values from the start of the original dataset, so we are going to use this to make our forcasting dataframe. Luckily, we still have observations from the entire daterange in our 'Emissions' dataframe!

In [30]:
# Create a new, natural gas-only dataframe based on the 'Emissions' dataframe.
NatGas = pd.DataFrame()
NatGas['Value'] = Emissions['Natural Gas']
NatGas.head(15)
Out[30]:
Value
YYYYMM
1973-01-01 12.175000
1973-02-01 11.708000
1973-03-01 13.994000
1973-04-01 14.627000
1973-05-01 17.344000
1973-06-01 20.264999
1973-07-01 23.066999
1973-08-01 22.850000
1973-09-01 19.297001
1973-10-01 17.896999
1973-11-01 13.766000
1973-12-01 11.866000
1974-01-01 11.959000
1974-02-01 11.021000
1974-03-01 13.860000
In [31]:
# View the dataframe's information.
NatGas.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 523 entries, 1973-01-01 to 2016-07-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Value   523 non-null    float32
dtypes: float32(1)
memory usage: 6.1 KB

Observations:

  • This new dataframe has a single column of floats.
  • There are 523 observations and no nulls in this series.

Milestone 1 Conclusion - Proposed Approach

  • Potential techniques:
    • To establish stationarity:
      • Augmented Dickey-Fuller Test
      • Logarithmic transformation
      • Differencing the series
      • Decompose the series into its trend, seasonality, and residual
    • To model the shifted data:
      • Use autocorrelation and partial correlation to determine p and q values
      • Split the data 80/20 to create training and testing datasets, respectively
      • Create the AR, MA, ARMA, and ARIMA models for the data
      • Calculate the RMSE for each potential model
      • Use inverse transformation to see how the model fares against the test data
  • Overall solution design: The end solution should focus on only one of the models as determined by which one is the most successful.
  • Measures of success:
    • A low AIC value
    • A low mean squared error
    • No overfitting; a similar RMSE between the training and testing data.

Milestone 2

Splitting into training and testing data.

In [32]:
# Splitting the dataset into 80% train and 20% test and visualizing them on the same plot.
train, test = train_test_split(NatGas, test_size=.20, train_size=.80, random_state=10, shuffle=False);

fig, ax = plt.subplots(figsize = (24, 8));
train.plot(ax = ax, color = 'grey');
test.plot(ax = ax, color = 'purple');
plt.legend(['Train Data', 'Test Data']);

Establishing Stationarity

Next, we will implement the Augmented Dicky-Fuller Test to test for stationarity. The test has the following conditions:

  • Null Hypothesis - the time series is non stationary due to seasonality and trend.
  • Alternate Hypothesis - the time series is stationary.
  • We are setting alpha at .05.
In [33]:
# We are going to visualize multiple plots of rolling means and standard deviations against the data.
# We will also use the Augmented Dicky-Fuller test more than once. To make this easier, we're going to make a function.

def MakePlot(x):
  # Visualize the rolling mean and standard deviation with window = 12.
  plt.figure(figsize = (24, 8))
  plt.plot(x, color = 'gray', label = 'Values')
  plt.plot(x.rolling(window = 12).mean(), color = 'blue', label = 'Mean', linestyle = '--')
  plt.plot(x.rolling(window = 12).std(), color = 'purple', label = 'Std. Dev.', linestyle = ':')
  plt.legend()
  plt.show()

  # Implement ADF test on the log data.
  print(adfuller(x['Value']))

Testing for Stationarity

In [34]:
# Visualizing the rolling mean and standard deviation and performing the ADF test on the training data.
MakePlot(train)
(1.3149120124899205, 0.9966911379959049, 17, 400, {'1%': -3.4468044036406247, '5%': -2.868792838125, '10%': -2.57063355625}, 1629.204432022423)

Observations:

  • The plot above shows strong seasonality with spikes mid-year and dips at the end and start of each year.
  • It also shows that, overall, carbon emissions from natural gas have been gradually increasing since approximately 1990.
  • The standard deviation is also increasing slightly over time.
  • The series is not stationary so we need to decompose the time series into trend, seasonality, and white noise.
  • Our p-value rounds up to 1 unless we include 6 decimal points. This is much higher than our alpha of .05 so we fail to reject the null hypothesis and say that the time series is not stationary.
In [35]:
train.info()
train.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 418 entries, 1973-01-01 to 2007-10-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Value   418 non-null    float32
dtypes: float32(1)
memory usage: 4.9 KB
Out[35]:
Value
YYYYMM
1973-01-01 12.175
1973-02-01 11.708
1973-03-01 13.994
1973-04-01 14.627
1973-05-01 17.344

Visualise the decomposed data.

In [36]:
# Using seasonal_decompose to create a dataframe with the individual components of the decomposed time series.
decomposition = sm.tsa.seasonal_decompose(train.Value)
DecompData = pd.DataFrame()
DecompData['trend'] = decomposition.trend
DecompData['seasonal'] = decomposition.seasonal
DecompData['noise'] = decomposition.resid

# Visualize the components.
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize = (24,24));
DecompData['trend'].plot(ax=ax1, title='Trend', color='orange');
DecompData['seasonal'].plot(ax=ax2, title='Seasonal', color='purple');
DecompData['noise'].plot(ax=ax3, title='Noise');

Observations:

  • There is an overall upward trend, indicating that there have been more natural gas carbon emissions over time.
  • There is very strong seasonality with such a regular pattern of spiking in the summer and dropping in the winter that it looks intentional at a cursory glance.
  • Noise appears to have a mean of 0, but its standard deviation may be increasing as time goes on.
  • We need to use a combination of differencing and log transformation to create a stationary time series.

Perform logarithmic transformation

In [37]:
# We will use logarithmic transformation to stabilize the standard deviation and then test again for stationarity using ADF.
# Perform the logarithmic transformation.
LogTrain = np.log(train)
print(LogTrain)
# Visualizing the rolling mean and standard deviation and performing the ADF test on the transformed training data.
MakePlot(LogTrain)
               Value
YYYYMM              
1973-01-01  2.499385
1973-02-01  2.460272
1973-03-01  2.638629
1973-04-01  2.682869
1973-05-01  2.853247
...              ...
2007-06-01  3.529825
2007-07-01  3.720862
2007-08-01  3.963837
2007-09-01  3.613455
2007-10-01  3.490428

[418 rows x 1 columns]
(0.2305862802256556, 0.9739258436685548, 17, 400, {'1%': -3.4468044036406247, '5%': -2.868792838125, '10%': -2.57063355625}, -768.4018163672417)

Observations:

While the standard deviation has stabalized, there is still strong seasonality, an upward trend, and a p-value of .97. It is now time to shift the series by one month and apply differencing using the lagged series.

Perform first order differencing.

In [38]:
# Calculate the first order difference.
Diff = LogTrain.diff(periods = 1).dropna()

# Visualizing the rolling mean and standard deviation and performing the ADF test on the differenced training data.
MakePlot(Diff)
(-4.14105786997301, 0.0008277641567700357, 18, 398, {'1%': -3.4468876315017423, '5%': -2.868829424528516, '10%': -2.570653059771218}, -775.95496758175)

Observations:

The graph of the rolling mean and standard deviation are both stabalized and our p-value is only .001. Since this is less than our alpha of .05, we reject the null hypothesis and conclude the series is now stationary. We are ready to begin modeling the data.

In [39]:
# Using seasonal_decompose to create a dataframe with the individual components of the decomposed time series.
decomposition = sm.tsa.seasonal_decompose(Diff.Value)
DecompData = pd.DataFrame()
DecompData['trend'] = decomposition.trend
DecompData['seasonal'] = decomposition.seasonal
DecompData['noise'] = decomposition.resid

# Visualize the components.
fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize = (48,24));
DecompData['trend'].plot(ax=ax1, title='Trend', color='orange');
DecompData['seasonal'].plot(ax=ax2, title='Seasonal', color='purple');
DecompData['noise'].plot(ax=ax3, title='Noise');

Observations:

There still appears to be some strong seasonality in the data, but our ADF p-value is already less than .05. Depending on how the AR, MA, ARMA, and ARIMA models perform, we may have to test other model types.

Build and Test Models

Find Optimal P and Q Parameters

In [40]:
# creating two subplots to show ACF and PACF plots
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(20, 12))

# creating and plotting the ACF charts to lag=12
tsaplots.plot_acf(Diff, zero=False, ax=ax1, lags = 12)

# creating and plotting the ACF charts to lag=12
tsaplots.plot_pacf(Diff, zero=False, ax=ax2, lags = 12)

plt.show()

Observations:

  • The ACF and PACF plots show that the highest lag at which each plot goes past the confidence interval boundary is lag 1.
  • We therefore conclude that p = 1 and q = 1.
In [41]:
#Apply AR model
plt.figure(figsize=(16,8))
model_AR = AutoReg(Diff, lags=1) #Use number of lags as 1 and apply AutoReg function on Diff series
results_AR = model_AR.fit() #fit the model
plt.plot(Diff)
predict = results_AR.predict(start=0,end=len(Diff)-1) #predict the series 
predict = predict.fillna(0) #Converting NaN values to 0
plt.plot(predict, color='red')
plt.title('AR Model - RMSE: %.4f'% mean_squared_error(predict,Diff['Value'], squared=False))  #Calculating rmse
plt.show()
In [42]:
results_AR.aic
Out[42]:
-369.51480330121024

Observations:

So far we are off to a solid start with a low RMSE of 0.1539 and an AIC of about -369.51.

In [46]:
#Apply MA model.
plt.figure(figsize=(16,8))
model_MA =  ARIMA(Diff, order=(0, 0, 1)) #Using p=0, d=0, q=1 and apply ARIMA function on Diff series
results_MA = model_MA.fit() #fit the model
plt.plot(Diff)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('MA Model - RMSE: %.4f'% mean_squared_error(results_MA.fittedvalues,Diff['Value'], squared=False))
plt.show()
In [47]:
results_MA.aic
Out[47]:
-359.52617619428963

Observations:

While the MA model has both a higher RMSE and a higher AIC than the AR model.

In [48]:
# Apply the ARMA model.
plt.figure(figsize=(16,8))
model_ARMA =  ARIMA(Diff, order=(1, 0, 1)) #Using p=1, d=0, q=1 and apply ARIMA function on Diff series
results_ARMA = model_ARMA.fit() #fit the model
plt.plot(Diff)
plt.plot(results_ARMA.fittedvalues, color='red')
plt.title('ARMA Model - RMSE: %.4f'% mean_squared_error(results_ARMA.fittedvalues,Diff['Value'], squared=False))
plt.show()
In [49]:
results_ARMA.aic
Out[49]:
-369.3243098805328

Observations:

This model performed slightly worse than the AR model with an AIC of -369.32.

In [50]:
#Apply the ARIMA model.
plt.figure(figsize=(16,8))
model_ARIMA = ARIMA(Diff, order=(1, 1, 1)) 
results_ARIMA = model_ARIMA.fit() #fit the model
plt.plot(Diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('ARIMA Model - RMSE: %.4f'% mean_squared_error(results_ARIMA.fittedvalues,Diff['Value'], squared=False))
plt.show()
In [51]:
results_ARIMA.aic
Out[51]:
-364.2033671349616

Observations:

ARIMA performed the worst of the for with a slightly higher RMSE and the highest AIC of -364.21. We will move forward with the AR model as it has the best RMSE and AIC.

In [66]:
# Printing the fitted values
predictions=pd.Series(results_AR.fittedvalues)

#First step - doing cumulative sum
predictions_cumsum = predictions.cumsum() # use .cumsum fuction on the predictions

#Second step - Adding the first value of the log series to the cumulative sum values
predictions_log = pd.Series()
predictions_log = pd.Series(LogTrain['Value'].iloc[13], index=LogTrain.index)
predictions_log = predictions_log.add(predictions_cumsum, fill_value=0)

#Third step - applying exponential transformation
predictions_AR = np.exp(predictions_log) #use exponential function
In [67]:
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(train, color = 'c', label = 'Original Series')  #plot the original train series
plt.plot(predictions_AR, color = 'r', label = 'Predicted Series')  #plot the predictions_ARIMA 
plt.title('Actual vs Predicted')
plt.legend()
plt.show()

Observations:

While this model is doing a decent job of capturing the overall trend of the data, it is not accurately reflecting its seasonality. Let's see how it performs against the test data.

In [68]:
#Forecasting the values for the test data
forecasted_AR = pd.DataFrame()
forecasted_AR['forecasted'] = results_MA.forecast(steps=105)

#First step - doing cumulative sum
forecasted_cumsum = forecasted_AR.cumsum() # use .cumsum fuction on the predictions

#Second step - Adding the first value of the log series to the cumulative sum values
forecasted_log = pd.Series()
forecasted_log = pd.Series(LogTrain['Value'].iloc[417], index=forecasted_cumsum.index)
forecasted_log = forecasted_log + forecasted_cumsum['forecasted']

#Applying exponential transformation to the forecasted log values
forecasted_AR = np.exp(forecasted_log) #use exponential function on forecasted data
In [69]:
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(NatGas, color = 'c', label = 'Original Series')
plt.plot(predictions_AR, color = 'r', label = 'Prediction on Train data') #plot the predictions_ARIMA series
plt.plot(forecasted_AR, label = 'Prediction on Test data', color='b')  #plot the forecasted_ARIMA series
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
In [70]:
mean_squared_error(predictions_AR, train, squared = False) #calculate RMSE using the predictions_ARIMA and df_train 
Out[70]:
4.260615889182347
In [71]:
mean_squared_error(forecasted_AR, test, squared = False)  #calculate RMSE using the forecasted_ARIMA and df_test
Out[71]:
7.807155050302542

Observations:

The graph clearly demonstrates that our model failed to capture the seasonality of the data. This is confirmed by the fact that the forecasted RMSE is nearly double that of the training RMSE. Our model is overfitting.

Build and Test Sarima Model

Identify Optimal Parameters

We are going to brute force test for all combinations of all values of p, d, q, P, D, and Q between 0 and 2. While there may be better parameters out there, this has to be weighed against the time and resources cost of performing the test.

In [58]:
# Create a list of all possible parameter combinations.
p = d = q = range(0, 2)
Vals = list(itertools.product(p, d, q))
SarimaValues = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

# Perform logarithmic transformation on the original data.
Log = np.log(NatGas)
In [59]:
SarimaAics = []
for Param in Vals:
  for SeasonalParam in SarimaValues:
    try:
      model_SARIMA = sm.tsa.statespace.SARIMAX(Log, order = Param, seasonal_order = SeasonalParam, enforce_stationarity=False, enforce_invertibility=False)
      SarimaResults = model_SARIMA.fit()
      print('ARIMA{}x{} - AIC:{}'.format(Param, SeasonalParam, SarimaResults.aic))
      if SarimaResults.mle_retvals is not None and SarimaResults.mle_retvals['converged'] == False:
        print(SarimaResults.mle_retvals)
        SarimaAics.append(SarimaResults.aic)
    except:
      continue
SarimaAics.sort()
print(SarimaAics[0])
ARIMA(0, 0, 0)x(0, 0, 0, 12) - AIC:2630.9639950446167
ARIMA(0, 0, 0)x(0, 0, 1, 12) - AIC:1941.9225198524941
ARIMA(0, 0, 0)x(0, 1, 0, 12) - AIC:-652.3246618446334
ARIMA(0, 0, 0)x(0, 1, 1, 12) - AIC:-652.4139110337937
ARIMA(0, 0, 0)x(1, 0, 0, 12) - AIC:-671.4316326887068
ARIMA(0, 0, 0)x(1, 0, 1, 12) - AIC:-715.8480710606453
ARIMA(0, 0, 0)x(1, 1, 0, 12) - AIC:-653.7126820551116
ARIMA(0, 0, 0)x(1, 1, 1, 12) - AIC:-651.9113957126955
ARIMA(0, 0, 1)x(0, 0, 0, 12) - AIC:1942.5311171291341
ARIMA(0, 0, 1)x(0, 0, 1, 12) - AIC:1300.6590240096739
ARIMA(0, 0, 1)x(0, 1, 0, 12) - AIC:-878.2871886214566
ARIMA(0, 0, 1)x(0, 1, 1, 12) - AIC:-904.3763290054048
{'fopt': -0.8703406587049759, 'gopt': array([-2.71550560e-06,  2.67730282e-07,  1.45785606e-04]), 'fcalls': 236, 'warnflag': 2, 'converged': False, 'iterations': 15}
ARIMA(0, 0, 1)x(1, 0, 0, 12) - AIC:-890.8730301141298
ARIMA(0, 0, 1)x(1, 0, 1, 12) - AIC:-968.6049942463142
ARIMA(0, 0, 1)x(1, 1, 0, 12) - AIC:-902.4720318218522
ARIMA(0, 0, 1)x(1, 1, 1, 12) - AIC:-903.7247525856363
ARIMA(0, 1, 0)x(0, 0, 0, 12) - AIC:-421.0708111441512
ARIMA(0, 1, 0)x(0, 0, 1, 12) - AIC:-631.6656147397348
ARIMA(0, 1, 0)x(0, 1, 0, 12) - AIC:-846.0232779798869
ARIMA(0, 1, 0)x(0, 1, 1, 12) - AIC:-1044.677329554193
ARIMA(0, 1, 0)x(1, 0, 0, 12) - AIC:-905.1880608703295
ARIMA(0, 1, 0)x(1, 0, 1, 12) - AIC:-1075.8970970289172
ARIMA(0, 1, 0)x(1, 1, 0, 12) - AIC:-956.6793202913155
ARIMA(0, 1, 0)x(1, 1, 1, 12) - AIC:-1042.6728801097784
ARIMA(0, 1, 1)x(0, 0, 0, 12) - AIC:-486.4634346036463
ARIMA(0, 1, 1)x(0, 0, 1, 12) - AIC:-657.2207725711255
ARIMA(0, 1, 1)x(0, 1, 0, 12) - AIC:-871.1945170150013
ARIMA(0, 1, 1)x(0, 1, 1, 12) - AIC:-1051.8153329958832
ARIMA(0, 1, 1)x(1, 0, 0, 12) - AIC:-915.566022989419
ARIMA(0, 1, 1)x(1, 0, 1, 12) - AIC:-1083.3602949981287
ARIMA(0, 1, 1)x(1, 1, 0, 12) - AIC:-985.2825958633645
ARIMA(0, 1, 1)x(1, 1, 1, 12) - AIC:-1049.813100118546
ARIMA(1, 0, 0)x(0, 0, 0, 12) - AIC:-420.8391663539429
ARIMA(1, 0, 0)x(0, 0, 1, 12) - AIC:-631.478827219697
ARIMA(1, 0, 0)x(0, 1, 0, 12) - AIC:-941.481693734452
ARIMA(1, 0, 0)x(0, 1, 1, 12) - AIC:-1075.7476276580237
ARIMA(1, 0, 0)x(1, 0, 0, 12) - AIC:-940.8590962219041
ARIMA(1, 0, 0)x(1, 0, 1, 12) - AIC:-1117.8047254652856
ARIMA(1, 0, 0)x(1, 1, 0, 12) - AIC:-1021.3976082948598
ARIMA(1, 0, 0)x(1, 1, 1, 12) - AIC:-1070.8598450765055
ARIMA(1, 0, 1)x(0, 0, 0, 12) - AIC:-485.15759879997995
ARIMA(1, 0, 1)x(0, 0, 1, 12) - AIC:-654.9524537203873
ARIMA(1, 0, 1)x(0, 1, 0, 12) - AIC:-936.9475467715761
ARIMA(1, 0, 1)x(0, 1, 1, 12) - AIC:-1074.1277187992307
ARIMA(1, 0, 1)x(1, 0, 0, 12) - AIC:-939.0836809880286
ARIMA(1, 0, 1)x(1, 0, 1, 12) - AIC:-1113.9487017879935
ARIMA(1, 0, 1)x(1, 1, 0, 12) - AIC:-1022.9711153116109
ARIMA(1, 0, 1)x(1, 1, 1, 12) - AIC:-1069.6431951038908
ARIMA(1, 1, 0)x(0, 0, 0, 12) - AIC:-499.20003627618775
ARIMA(1, 1, 0)x(0, 0, 1, 12) - AIC:-656.3422278672011
ARIMA(1, 1, 0)x(0, 1, 0, 12) - AIC:-857.8534866574691
ARIMA(1, 1, 0)x(0, 1, 1, 12) - AIC:-1049.6242079859908
ARIMA(1, 1, 0)x(1, 0, 0, 12) - AIC:-906.6439556046776
ARIMA(1, 1, 0)x(1, 0, 1, 12) - AIC:-1081.2426610736902
ARIMA(1, 1, 0)x(1, 1, 0, 12) - AIC:-969.8026917496442
ARIMA(1, 1, 0)x(1, 1, 1, 12) - AIC:-1047.6232469681152
ARIMA(1, 1, 1)x(0, 0, 0, 12) - AIC:-496.9651830779932
ARIMA(1, 1, 1)x(0, 0, 1, 12) - AIC:-655.6467623480365
ARIMA(1, 1, 1)x(0, 1, 0, 12) - AIC:-934.6927820836141
ARIMA(1, 1, 1)x(0, 1, 1, 12) - AIC:-1100.9027019279893
ARIMA(1, 1, 1)x(1, 0, 0, 12) - AIC:-978.0405850081135
ARIMA(1, 1, 1)x(1, 0, 1, 12) - AIC:-1132.6758448397584
ARIMA(1, 1, 1)x(1, 1, 0, 12) - AIC:-1021.4519177616553
ARIMA(1, 1, 1)x(1, 1, 1, 12) - AIC:-1099.281334548051
-904.3763290054048

Observations:

Based purely on AIC, our best parameter combination is (1, 1, 1) x (1, 0, 1, 12) with an AIC of -1132.67. Before moving forward, though, we need to check and make sure that the residuals of the mean are uncorrelated and normally distributed with a zero-mean.

In [60]:
model_SARIMA = sm.tsa.statespace.SARIMAX(Log, order = (1, 1, 1), seasonal_order = (1, 0, 1, 12))
SarimaResults = model_SARIMA.fit()
print(SarimaResults.summary())
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                              Value   No. Observations:                  523
Model:             SARIMAX(1, 1, 1)x(1, 0, 1, 12)   Log Likelihood                 582.824
Date:                            Sat, 22 Jan 2022   AIC                          -1155.649
Time:                                    06:49:02   BIC                          -1134.360
Sample:                                01-01-1973   HQIC                         -1147.311
                                     - 07-01-2016                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6865      0.035     19.791      0.000       0.618       0.754
ma.L1         -0.9539      0.017    -55.312      0.000      -0.988      -0.920
ar.S.L12       0.9886      0.006    155.864      0.000       0.976       1.001
ma.S.L12      -0.7473      0.034    -21.684      0.000      -0.815      -0.680
sigma2         0.0060      0.000     17.880      0.000       0.005       0.007
===================================================================================
Ljung-Box (L1) (Q):                   0.44   Jarque-Bera (JB):                10.91
Prob(Q):                              0.51   Prob(JB):                         0.00
Heteroskedasticity (H):               1.72   Skew:                             0.09
Prob(H) (two-sided):                  0.00   Kurtosis:                         3.68
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Observations:

  • All of our model's terms have a statistically significant p-value, so we can keep all of them.
  • The p-value of the Ljung-Box test is .50, which is not statistically significant. We therefore can't reject the null hypothesis that our residuals are white noise.
  • The p-value of the heteroskedasticity is less than .05, so we reject the null hypothesis that the residuals are homoskedastic.
  • The Jarque-Bera test's p-value is statistically significant, so we reject the null hypothesis that the residuals are normally distributed. This is due to the slight positive skew and large kurtosis.
In [61]:
SarimaResults.plot_diagnostics(figsize=(15, 12))
plt.show()

Observations:

  • The standardized residual plot does not have any obvious seasonality and appears to be white noise.
  • As the summary statistics indicated, the plot of the density function has a slight positive skew as the right tail has a local maximum.
  • The Q-Q plot shows that the forecast errors slightly deviate from the red KDE line. This shows that the normal distribution is not a perfect fit for our residuals.
  • All values in the correlogram are within the confidence interval.
  • While the test statistics and plots show that our residuals do not perfectly resenble a normal distribution, it is not unreasonable. Given the excellent AIC of the SARIMA model and how slight the skew and kurtosis is for our residuals, the SARIMA model will probably give us better information and more accurate forecasts than the AR model would.

Testing the SARIMA Model.

Since we are going to test both the one-step-ahead forecast and the dynamic forecast as well as predict beyond the dates in our dataset, there are some processes we will perform repeatedly. To be more efficient, we will define functions to call later.

In [62]:
# This function gets predictions' confidence intervals, exponentiates them, plots them against the actual values, and prints the RMSE and MSE.
def PredResults(pred, errors = True):
  # Get the fitted predictions.
  predictions_SARIMAX = np.exp(pred.conf_int())

  #Plot predictions agains values.
  ax = NatGas['1973':].plot(label='observed',figsize=(24, 8))
  np.exp(pred.predicted_mean).plot(label='Forecast', ax=ax, color = 'r')
  ax.fill_between(predictions_SARIMAX.index,
                predictions_SARIMAX.iloc[:, 0],
                predictions_SARIMAX.iloc[:, 1], color='y', alpha=.5)
  ax.set_xlabel('Time (years)')
  ax.set_ylabel('NG CO2 Emissions')
  plt.legend()
  plt.show();

  # Print the MSE and RMSE for predictions. If a forecast, check the p-values.
  if errors == True:
    NatGas_forecast = np.exp(pred.predicted_mean)
    NatGas_truth = NatGas.Value.iloc[418:523]
    mse = ((NatGas_forecast - NatGas_truth) ** 2).mean()
    print('The Mean Squared Error (MSE) of the forecast is {}'.format(round(mse, 2)))
    print('The Root Mean Square Error (RMSE) of the forecast: {:.4f}'
        .format(np.sqrt(sum((NatGas_forecast-NatGas_truth)**2)/len(NatGas_forecast))))
    
  else:
    print(SarimaResults.pvalues)
  
  return(pred, predictions_SARIMAX);
In [63]:
# Get predictions with a one-step-ahead forecast.
OneStep = PredResults(SarimaResults.get_prediction(start = 418, end = 522, dynamic=False))
The Mean Squared Error (MSE) of the forecast is 4.4
The Root Mean Square Error (RMSE) of the forecast: 2.0979

Observations:

While the RMSE is higher for the SARIMA model than it was for the AR model, we can clearly see from the graph that the SARIMA model did a much better job capturing the magnitude of the data's seasonality. Maybe a dynamic model will perform better.

In [64]:
# Get predictions with a dynamic forecast.
Dynamic = PredResults(SarimaResults.get_prediction(start = 418, end = 522, dynamic=True))
The Mean Squared Error (MSE) of the forecast is 18.09
The Root Mean Square Error (RMSE) of the forecast: 4.2535

Observations:

The dynamic SARIMA's MSE and RMSE are 4 times and 2 times greater than those of the one-step-ahead forecast, respectively. We will therefore use the one-step-ahead forecast to predict the next 2 years.

In [65]:
# Forecast the next two years with a one-step-ahead forecast.
Forecast = PredResults(SarimaResults.get_forecast(steps = 24, dynamic=False), False)
ar.L1        3.567778e-87
ma.L1        0.000000e+00
ar.S.L12     0.000000e+00
ma.S.L12    2.899682e-104
sigma2       1.695025e-71
dtype: float64

Module 2 Conclusion

Final Solution Design:

  • We have tested the CO2 emissions forecasting accuracy of the AR, MA, ARMA, ARIMA, and SARIMA methods.
  • SARIMA's low RMSE, very low AIC, fulfillment of all necessary statistical assumptions, and ability to model the data's seasonality make it the best model out of those tested.
  • SARIMA is the preferred model. The RMSE and the Jaque-Bera test both suggest that more refinement beyond the scope of this project could result in an even better model.
  • We have forecasted the next 24 months of CO2 emissions data.

Refined Insights:

  • Natural gas CO2 emissions have a very strong annual seasonality.
  • Coal currently emits more CO2 than any other energy source, including natural gas.
  • This may soon change as coal's CO2 emissions started declining around 2008 and natural gas's CO2 emissions began increasing around 2000. If coal were to continue decreasing from its CO2 emissions of approximately 100 MMT and the forecast for natural gas is accurate, it is highly possible natural gas will become the greatest CO2 contributor.
  • We failed to reject the null hypothesis that the series is stationary due to its variation in standard deviation and mean over time. We therefore need to be cautious and not assume that natural gas's emissions are dependent on coal's when it could merely be dependent on time.
  • It would behoove future researchers to take into account the average Watt to ton ratio of each of the energy sources and who is using which. It could be better or worse for the environment that natural gas appears to be slowly replacing coal.